The uniting theme between these projects is visualizing information—from/about a data analysis—that is often overlooked in research dissemination, despite its relevance to our understanding of the analysis.


smallsets: Visual documentation for data preprocessing

The smallsets R package generates visual documentation for data preprocessing in R, R Markdown, Python, and Jupyter Notebooks. Users add structured comments to their preprocessing code and pass it to smallsets. The output from smallsets is a Smallset Timeline, a static and compact visualisation providing a step-by-step explanation of preprocessing decisions. It uses a small set (“Smallset”) of rows from the original dataset to visualize the process at a manageable scale.

smallsets website: lydialucchesi.github.io/smallsets/
Smallset Timeline paper: doi.org/10.1145/3531146.3533175

Below is an example Smallset Timeline visualizing the preprocessing of a synthetic dataset with 100 rows and eight columns (C1-C8). It uses a Smallset of five rows and is composed of three Smallset snapshots showing three preprocessing steps.




Vizumap: Visualizing uncertainty on maps

The Vizumap R package offers four methods for mapping statistical estimates and uncertainties simultaneously. Users can build a bivariate map, pixel map (with optional animated flickering), glyph map, or exceedance probability map. Vizumap has been used in ecology and public health research.

Below is an example pixel map that I built with Vizumap. Pixel maps divide regions into pixels, and the pixels are filled with values sampled from the confidence interval for a region’s estimate. The U.S. equal area cartogram below visualizes 2020 election forecasts from FiveThirtyEight.1 The pixel values are uniformly sampled from the 80% confidence interval for each state’s forecasted vote share for Biden. States with wider confidence intervals appear more pixelated because the confidence interval spans more of the color gradient. Conversely, states with narrower confidence intervals appear less pixelated.

     

     

1 I used variables voteshare_chal (vote share), voteshare_chal_lo (80% CI lower bound), and voteshare_chal_hi (80% CI upper bound) from the file presidential_state_toplines_2020.csv available in the FiveThirtyEight GitHub repository.